Skip to content

Enable SME2 Streaming SVE in ARM#9126

Open
stevesuzuki-arm wants to merge 6 commits into
halide:mainfrom
stevesuzuki-arm:pr-sme2
Open

Enable SME2 Streaming SVE in ARM#9126
stevesuzuki-arm wants to merge 6 commits into
halide:mainfrom
stevesuzuki-arm:pr-sme2

Conversation

@stevesuzuki-arm
Copy link
Copy Markdown
Contributor

@stevesuzuki-arm stevesuzuki-arm commented May 7, 2026

Enable SME2 Streaming SVE in ARM

This PR adds initial ARM SME2 streaming-mode support to Halide,
which allows us to compute with longer vector length SVE on targets with SME2.

A new sme_streaming(enable, var) scheduling directive provides the users
the option to control which loop is computed in streaming-mode.

The change introduces a new Target::SME2 feature with supplemental features Target::SME_SVLDDD, where DDD represents streaming vector length in bits (e.g. 128, 256, 512, ...). If Target::SME2 is enabled, exactly one of Target::SME_SVLDDD feature must be enabled as well.
natural_vector_size() now depends on whether in streaming-mode or not,
because streaming vector length may have a value different from non-streaming vector length.

In Halide lowering, a new LowerSMEStreamingTasks pass is added,
which extracts the loop with streaming-mode as internal closure function
so that we can attach the LLVM function attributes to transit to/from streaming-mode.

  • aarch64_pstate_sm_body to emit smstart/smstop transition
  • NoInline to prevent streaming closure from inlined to non-streaming function

In CodeGen, target_vscale() depends on whether streaming-mode or not
and it varies even in a Module, although it is constant within Function boundary.
In streaming-mode, vector type code-gen and intrinsic selection are
performed based on Target::sme_streaming_vector_bits() (streaming vscale).
In terms of coverage, it is almost the same as existing SVE2 code-gen
while SME2 specific instruction has not been enabled for now.

Additionally, the following changes are implemented:

  • Auto-detect SME2 and SME_SVLDDD target features on host CPU
  • Fall back from streaming SVE when vectorization factors are not feasible
  • Gather/scatter in streaming mode is scalarized with warning
  • Add runtime checks for streaming vscale mismatches with compile-time vscale

Checklist

  • Tests added or updated (not required for docs, CI config, or typo fixes)
  • Documentation updated (if public API changed)
  • Python bindings updated (if public API changed)

Added:
- Target::SME2 definition
- streaming_vector_bits in Target for SME2
- Auto-detect SME2 and streaming_vector_bits
- sme_streaming() scheduling directive in Func and Pipeline
- DeviceAPI::Host_SMEStreaming in IR "For"
- LowerSMEStreamingTasks pass to extract streaming closure
- Attribute in LoweredFunc for streaming closure
- LLVM Function attribute to control streaming mode
  - NoInline to prevent streaming closure from inlined
  - "aarch64_pstate_sm_body" to emit smstart/smstop transition
- Disable gather/scatter in SME streaming mode

Tests:
- Add correctness/sme_streaming
- Run simd_op_check_sve2 in SME streaming mode
- Add test to assert runtime streaming vscale
@stevesuzuki-arm
Copy link
Copy Markdown
Contributor Author

This PR is ready for review. I will touch on this in dev meeting if I have a chance.

Reason:
  While vector_bits is used across multiple target architectures,
  streaming_vector_bits is aarch64 specific. So we choose to
  use Target::Feature rather than a new member for arbitrary bits.

- Removed Target::streaming_vector_bits member variable
- Added Feature::SME_SVL{128,256,512,1024,2048}
Revert the changes in halide_error_vscale_invalid to
avoid potential runtime breaking changes.
Because streaming_vector_bits member variable has been removed.
@stevesuzuki-arm
Copy link
Copy Markdown
Contributor Author

Based on the feedback in dev meeting, streaming_vector_bits has been replaced with Feature::SME_SVL

@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

❌ Patch coverage is 62.41379% with 109 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@b6f2e8b). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/CodeGen_ARM.cpp 54.25% 34 Missing and 9 partials ⚠️
src/Target.cpp 51.68% 23 Missing and 20 partials ⚠️
src/LowerSMEStreamingTasks.cpp 81.69% 9 Missing and 4 partials ⚠️
src/IRPrinter.cpp 0.00% 2 Missing and 1 partial ⚠️
src/Lower.cpp 72.72% 1 Missing and 2 partials ⚠️
src/Profiling.cpp 0.00% 0 Missing and 2 partials ⚠️
src/DeviceInterface.cpp 0.00% 0 Missing and 1 partial ⚠️
src/Target.h 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9126   +/-   ##
=======================================
  Coverage        ?   69.80%           
=======================================
  Files           ?      256           
  Lines           ?    77960           
  Branches        ?    18617           
=======================================
  Hits            ?    54419           
  Misses          ?    17996           
  Partials        ?     5545           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants